Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Sequencing and Raw Sequence Data Quality Control ◾ 15

command (make sure that you have installed the SRA-toolkit on your computer and it is

on the path):

mkdir fastqs

cd fastqs

mkdir single

cd single

fasterq-dump --verbose SRR030834

As shown in Figure 1.7, the FASTQ file “SRR030834.fastq” has been downloaded to the

directory, and we will use that file to show how to use some Linux commands to perform

some operations with that file.

FASTQ files may contain up to millions of entries, and their sizes can be several mega-

bytes or gigabytes, which often make them too large to open in a normal text editor. In

general, no need to open a FASTQ file unless it is necessary for troubleshooting or out of

curiosity. To display a large FASTQ file, we can use some Unix or Linux commands such

as “less” or “more” to display very large text file page by page or “cat” to display the content

of the file.

less SRR030834.fastq

more SRR030834.fastq

cat SRR030834.fastq

If a FASTQ file name ends with the “.gz” extension, that means the file is compressed with

“gzip” program. In this case, instead of “less”, “more”, and “cat” commands, use “zless”,

“zmore”, and “zcat” commands, respectively, without decompressing the files.

We can also use “head” and “tail” to display the first lines and last lines, respectively.

The following command will display the first 15 lines of the file:

head -15 SRR030834.fastq

If a FASTQ file is large, we can compress it with the “gzip” program to reduce its size more

than three times. Compressing the “SRR030834.fastq” file with gzip will reduce its size to

less than one gigabyte.

gzip SRR030834.fastq

The file name will become “SRR030834.fastq.gz”.

FIGURE 1.7 Downloading a FASTQ file from the NCBI SRA database.